Experiments in TREC 2004 Novelty Track at CAS-ICT

نویسندگان

  • Huaping Zhang
  • Hongbo Xu
  • Shuo Bai
  • Bin Wang
  • Xueqi Cheng
چکیده

The main task in Novelty Track is to retrieve relevant sentences and remove duplicates from a document set given a TREC topic. This track took place for the first time in TREC 2002 and it is refined to four tasks in TREC 2003. Besides 25 relevant documents, irrelevant ones are given in this year of Novelty track. In other words, a given document is either relevant or irrelevant to the topic. There are 1808 documents in 50 TREC topics. Average 11.18 documents are noise for each topic. In topic N75, the number of noise is 45. Once we mistook an irrelevant document as relevance, all results in the document are wrong. Except the document retrieval, more limited information could be applied in the last three tasks than ever. Among the first 5 given documents, average 3.14 documents are relevant and average 2.76 are new. Especially, 9 topics have no relevant sentence in the first 5 ones. In TREC2004, ICT divided Novelty track into four sequential stages. It includes: customized language parsing on original dataset, document retrieval, sentence relevance and novelty detection. The architecture in novelty is given in Figure 1. In the first preprocessing stage, we applied sentence segmenter, tokenization, part-of-speech tagging, morphological analysis, stop word remover and query analyzer on topics and documents. As for query analysis, we categorized words in topics into description words and query words. Title, description and narrative parts are all merged into query with different weights. In the stage of document and sentence retrieval, we introduced vector space model (VSM) and its variance, probability model OKAPI and statistical language model. Based on VSM, we tried various query expansion strategies: pseu-feedback, term expansion with synset or synonym in WordNet[1] and expansion with highly local co-occurrence terms. With regard to the novelty stage, we defined three types of new degree: word overlapping and its extension, similarity comparison and information gain. In the last three tasks, we used the known results to adjust threshold, estimate the number of results, and turned to classifier, such as inductive and transductive SVM. Topic/Doc Parsing

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TREC 2004 Web Track Experiments at CAS-ICT

This report presents CAS-ICT’s experiments on the Mixed query task of the TREC2004 Web track. Our work focused on combining different Web page evidences together to improve the overall retrieval performance. Four kinds of evidences, including body content(C), anchor texts (AT), basic structural information (S0) and extended structural information (S1) were considered for retrieval. Six combinat...

متن کامل

TREC-10 Experiments at CAS-ICT: Filtering, Web and QA

CAS-ICT took part in the TREC conference for the first time this year. We have participated in three tracks of TREC-10. For adaptive filtering track, we paid more attention to feature selection and profile adaptation. For web track, we tried to integrate different ranking methods to improve system performance. For QA track, we focused on question type identification, named entity tagging and an...

متن کامل

TREC 11 Experiments at CAS-ICT: Filtering and Web

CAS-ICT took part in the TREC conference for the second time this year and we undertook two tracks of TREC-11. For filtering track, we have submitted results of all three subtasks. In adaptive filtering, we paid more attention to undetermined documents processing, profile building and adaptation. In batch filtering and routing, a centroid-based classifier is used with preprocessed samples. For ...

متن کامل

Improved Feature Selection and Redundance Computing - THUIR at TREC 2004 Novelty Track

This is the third years that Tsinghua University Information Retrieval Group (THUIR) participates in Novelty task of TREC. Our research on this year’s novelty track mainly focused on four aspects: (1) text feature selection and reduction; (2) improved sentence classification in finding relevant information; (3)efficient sentence redundancy computing; (4) effective result filtering. All experime...

متن کامل

TREC 2003 Novelty and Web Track at ICT

In this paper, we will present our approaches and experiments on the following two tracks of TREC-2003: Novelty track and Web track. The novelty track can be treated as a binary classification problem: relevant sentences vs. irrelevant sentences, or new vs. non-new. In this way, we applied variants of techniques that have been employed for text categorization problem. To retrieve the relevant s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004